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We describe two improvements to Chait in-style graph coloring register allocators. The first, opti- 
mistic coloring, uses a stronger heuristic to find a fc-coloring for the interference graph. The second 
extends Chaitin’s treatment of rematerialization to handle a larger class of values. These tech- 
niques are complementary. Optimistic coloring decreases the number of procedures that require 
spill code and reduces the amount of spill code when spilling is unavoidable. Rematerialization 
lowers the cost of spilling some values. 

This paper describes both the techniques themselves and our experience building and using 
register allocators that incorporate them. It provides a detailed description of optimistic coloring 
and rematerialization. It presents experimental data to show the performance of several versions 
of the register allocator on a suite of FORTRAN programs. It discusses several insights that we 
discovered only after repeated implementation of these allocators. 

Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors- — compilers, 
optimization 

General terms: Languages 
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1. INTRODUCTION 

The relationship between run-time performance and effective use of a machine’s 
register set is well understood. In a compiler, the process of deciding which values 
to keep in registers at each point in the generated code is called register allocation. 
Values in registers can be accessed more quickly than values in memory - on high- 
performance, microprocessor-based machines, the difference in access time can be 
an order of magnitude. Thus, register allocation has a strong impact on the run- 
time performance of the code that a compiler generates. Because relative memory 
latencies are rising while register latencies are not, the impact of allocation on 
performance is increasing. In addition, features like superscalar instruction issue 
increase a program’s absolute demand for registers - if the machine issues two 
instructions in a single cycle, it must have two sets of operands ready and in place 
at the start of the cycle. This naturally increases the demand for registers. 

Popular techniques for performing register allocation are based on a graph color- 
ing paradigm. These allocators construct a graph representing the constraints that 
the allocator must preserve. Using graph coloring techniques, they discover a map- 
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ping from values in the procedure to registers in the target machine; the map- 
ping must observe the constraints. The first graph coloring allocator was built by 
Chaitin and his colleagues [8]. Another approach, called priority-based coloring, 
was described by Chow and Hennessy [10, 11]. These two techniques have formed 
the core around which a rich literature has emerged (see Section 7). 

The techniques used in building graph coloring allocators can be improved. In 
recent years, several important extensions to Chaitin’s basic techniques have ap- 
peared [2, 30]. Nevertheless, problems remain. In practice, most of these problems 
appear as either over-spilling or a poor spill choice. In the former case, the allocator 
fails to keep some value in a register, even though a register is available throughout 
its lifetime. In the latter case, the allocator chooses the “wrong” value to keep in a 
register at some point in the code. 

This paper presents two improvements to existing techniques for register alloca- 
tion via graph coloring. The next section provides necessary background, describing 
Chaitin’s allocator. The following two sections provide a detailed description of op- 
timistic coloring and rematerialization, two improvements to Chaitin’s approach. 
Section 5 presents experimental data to show the performance of several versions of 
the register allocator on a suite of FORTRAN programs. Section 6 discusses several 
insights that we discovered only after repeated implementation of these allocators. 
Finally, section 7 presents a discussion of related work. 

2. REGISTER ALLOCATION VIA GRAPH COLORING 

The notion of abstracting storage allocation problems to graph coloring dates from 
the early 1960’s [28] (see Section 7). The first implementation of a graph coloring 
register allocator was done by Chaitin and his colleagues in the PL. 8 compiler [8, 6[. 
Chow and Hennessy later described a priority-based scheme for allocation based on 
a coloring paradigm [10, 11]. Almost all subsequent work on coloring-based allo- 
cation has followed from one of these two papers. Our own work follows Chaitin’s 
scheme. 

Any discussion of register allocation will contain several implicit assumptions. 
For our work, we assume that the allocator works on low-level intermediate code 
or assembly code. The code has been shaped by an optimization phase. Before 
allocation, the code can reference an unlimited number of registers. We call these 
“pre-allocation” registers virtual registers. The allocator does not work with the 
virtual registers; instead, it works with the distinct live ranges in a procedure. 
A single virtual register can have several distinct values that are live in different 
parts of the program - each of these values will become a separate live range. 
The allocator discovers all the separate live ranges and allocates them to physical 
registers on the target machine. 

To model register allocation as a graph coloring problem, the compiler first con- 
structs an interference graph G. The nodes in G correspond to live ranges and the 
edges represent interferences. Thus, there is an edge in G from node i (live range 
l t ) to node j if and only if Z, interferes with lj ; that is, they are simultaneously live 
at some point and cannot occupy the same register. 1 The live ranges that interfere 
with a particular live range h are called neighbors of k in the graph; the number 
of neighbors is the degree - denoted 1°. 


1 See Chaitin et a 1. for a complete discussion of interference [8]. 
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Fig. 1. Chaitin’s Allocator 


An attractive feature of Chaitin’s approach is that machine-specific constraints on 
register use can be represented directly in the graph [8] . Thus, the graph represents 
both the constraints embodied in the program and those presented by the target 
architecture in a single, unified structure. This is one of the key insights underlying 
graph coloring allocators: the interference graph represents all the constraints. 

To find an allocation from G, the compiler looks for a fc-coloring of G; that is, 
an assignment of k colors to the nodes of G such that adjacent nodes always have 
distinct colors. If we choose k to match the number of machine registers, then we 
can map a fc-coloring of G into a feasible register assignment for the underlying 
code. Because graph coloring is NP-complete [21], the compiler uses a heuristic 
method to search for a coloring; it is not guaranteed to find a fc-coloring for all 
fc-colorable graphs. If a fc-coloring is not discovered, some values are spilled; that 
is, the values are kept in memory rather than in registers. 

Spilling one or more live ranges creates a new and different interference graph. 
The compiler proceeds by iteratively spilling some live ranges and attempting to 
color the resulting new graph. In practice, a Chaitin-style allocator rarely requires 
more than three trips through this loop. Figure 1 illustrates a Chaitin-style alloca- 
tor. It proceeds in seven phases. 

(1) Renumber systematically renames live ranges. It creates a new live range for 
each definition point. At each use point, it unions together the live ranges that 
reach the use. Our implementation models this as an example of the classical 
disjoint set union-find problem. (In the papers on the PL. 8 compiler, this 
analysis is called “getting the right number of names” [8] . The Hewlett-Packard 
Precision Architecture compiler papers refer to this as “web analysis” [25].) 

(2) Build constructs the interference graph. Our implementation closely follows 
the published descriptions of the PL. 8 allocator [8, 6]. The interference graph 
is simultaneously represented as a bit-matrix and as a collection of adjacency 
lists. 

(3) Coalesce attempts to shrink the number of live ranges. Two live ranges l t 
and lj are combined if the initial definition of lj is a copy from li and they 
do not otherwise interfere. Combining the two live ranges eliminates the copy 
instruction. We denote the new live range lij. 

When the allocator combines li and lj, it can construct an imprecise but con- 
servative approximation to the set of interferences for lij. The conservative 
update lets the allocator batch together many combining steps. It performs all 
the coalescing possible with the update, then repeats both build and coalesce 
if coalescing has changed the graph. 

ACM Transactions on Programming Languages and Systems, Vol. 16, No. 3, May 1994, pp. 428—455. 









4 


P. Briggs, K. D. Cooper, and L. Torczon 



simple graph diamond graph 

Fig. 2. Example Graphs 


(4) Spill costs estimates, for each live range, the run-time cost of any instructions 
that would be added if the item were spilled. This cost is estimated by com- 
puting the number of loads and stores that would be required to spill the live 
range, with each operation weighted by c x 10 d , where c is the operation’s cost 
on the target architecture and d is the instruction’s loop-nesting depth. 

(5) Simplify constructs an ordering of the nodes. It creates an empty stack, then 
repeats the following two steps until the graph is empty: 

(a) If there exists a node l t with 1° < k, remove Z, and all of its edges from the 
graph. Place l t on the stack for coloring. 

(b) Otherwise, choose a node Z, to spill. Remove Z, and all of its edges from 
the graph. Mark l, to be spilled. 

After this, if any node is marked for spilling, the allocator inserts spill code 
(see 7 below) and repeats the allocation process. If no spilling is required, it 
proceeds to select (see 6 below). 

(6) Spill code is invoked if simplify decides to spill a node. Each spilled live range 
is converted into a collection of tiny live ranges by inserting loads before uses 
and stores after definitions, as required. 

(7) Select assigns colors to the nodes of the graph in the order determined by 
simplify. It repeats the following steps until the stack is empty: 

(a) pop a live range from the stack, 

(b) insert its corresponding node into G, and 

(c) give it a color distinct from its neighbors. 

To understand why this works, consider the actions of simplify and select. Simplify 
only moves li from the graph to the stack if 1° < k. Any live range that meets this 
condition is trivially colorable - that is, it will receive a color independent of the 
colors assigned to its neighbors. Thus, simplify only removes a node when it can 
prove that the node will get assigned a color. As each live range is removed, the 
degrees of its neighbors are lowered. This, in turn, may prove that they can be 
assigned colors. 

Select assigns colors to the nodes in reverse order of removal. Thus, it colors each 
node in a graph where it is trivially colorable; simplify ordered the stack so that 
this must be true. In one sense, the ordering colors the most constrained nodes first 
- li gets colored before lj precisely because simplify proved that lj was colorable 
independent of the specific color chosen for Z^. 
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As an example, consider finding a three-coloring for the “simple graph” shown 
in Figure 2. First, simplify removes all the nodes; it does not need to spill any 
of them. One possible sequence of removals is a, c, b, d, e. Next, select reinserts 
nodes into the graph, assigning colors. Node e is inserted first and can be given any 
color, say red. Next, d is added; it can get any color except red. Running through 
the entire stack might result in the assignment: e <— red, d <— blue, b <— green, 
c <— red, and a <— blue. 

The decision to spill a live range is made in simplify. When the allocator cannot 
find a node that is trivially colorable, it selects a node to spill. The metric for 
picking spill candidates is important. Chaitin suggests choosing the node with the 
smallest ratio of spill cost divided by current degree of the node [6] . 

Chaitin’s heuristic is not guaranteed to find the minimal coloring nor can it be 
guaranteed to find a fc-coloring if it exists; after all, graph coloring is NP-complete. 
For example, suppose we want to find a two-coloring of the “diamond graph” shown 
in Figure 2. Clearly, one exists; for example w <— red, x <— blue, y <— blue, and 
z <— red. Applying simplify to the diamond graph presents an immediate problem 
because there are no nodes with degree less than two. Thus, some node is selected 
for spilling. If all spill costs are equal, the allocator will make an arbitrary choice; 
for example, x. After x is removed from the graph and marked for spilling, simplify 
will remove the remaining three nodes without further spilling. Since a node was 
marked to spill, the allocator must insert spill code, rebuild the interference graph, 
and try again. 

Even though examples like the diamond graph exist, Chaitin’s technique produces 
good allocations in practice. Several factors contribute to its success. Allocation is 
based on global information in the form of a precise interference graph. It includes 
a powerful mechanism to remove unneeded copies - coalescing. Finally, it uses 
spill costs to guide the generation of spill code; those spill costs encode simple 
information about the control-flow graph. Any improvements to Chaitin’s work 
should retain these properties. 

3. OPTIMISTIC COLORING 

As part of the ParaScope programming environment [12], we built an optimizing 
compiler for FORTRAN running on uniprocessors. The initial implementation in- 
cluded a register allocator that used Chaitin’s technique as described in Section 2. 
The allocator worked well and seemed to produce satisfactory allocations. It re- 
quired modest amounts of time and space at compile time. However, as we debugged 
other parts of the compiler, we discovered several cases where it produced obviously 
flawed allocations. 

3.1 A Motivating Problem 

A particularly interesting case arose in the code generated for the singular value 
decomposition (SVD) of Golub and Reinsch [22]. The actual code was from the 
software library distributed with Forsythe, Malcolm, and Moler’s book on numerical 
methods [18]. It has two hundred fourteen lines of code, excluding comments. The 
code contains thirty-seven DO-loops organized into five different loop nests. The 
first loop nest is a simple array copy; four larger and more complex loop nests 
follow. Figure 3 shows its structure. 
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subroutine SVD(M, N, ...) 
do I = 1, N 
do J = 1, M 

A(I, J) = B (I , J) 
enddo 
enddo 

do 

many deeply-nested loops 

enddo 

do 

enddo 
do 

enddo 
do 

enddo 


I I I J 

M N I 


many long live ranges 


Fig. 3. The Structure of SVD 


In SVD, the allocator spilled many short live ranges in deference to the longer, 
more expensive live ranges. In the array-copy loops, it spilled the loop indices and 
limits even though several registers were unused in the loop. After some study, we 
understood both why the register allocator over-spilled so badly and what situations 
provoked this behavior. 

After optimization, about a dozen long live ranges extend from the initialization 
portion of the code down into the large loop nests. As the figure shows, the long 
live ranges span the small array-copy loops. Because they span so much of the 
code, they all have high degree in the interference graph. Additionally, they have 
large spill costs because they are referenced often inside deeply nested loops. They 
restrict the graph so much that the allocator must spill some live ranges. 

Initially, the allocator chooses to spill the indices and limits on the array-copy 
loops. This choice is correct. Because these values have smaller estimated spill costs 
than the longer live ranges, the allocator should choose them first. Unfortunately, 
spilling them does not help; the problem is in the later loop nests. To alleviate the 
demand for registers in the large loop nests, the allocator must spill more values. 
As it proceeds, it eventually spills most of the longer live ranges. As a result, there 
are unused registers in the array-copy loops (and the indices and limits are kept in 
memory) . 

3.2 An Improved Coloring Heuristic 

Knowing that the allocator over-spilled on both the diamond graph and the SVD, 
we reconsidered the allocation process. Each of the examples highlights a different 
problem. 

(1) The diamond graph is two-colorable; we can see that by inspection. The allo- 
cator fails because the approximation that it uses to decide whether x will get 
a color is too weak. 
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In looking for a fc-coloring, the allocator approximates “x gets a color” by “x 
has degree less than k.” This is a sufficient condition for x to get a color but not 
a necessary condition. For example, x may have k neighbors, but two of those 
neighbors may get assigned the same color. This is precisely what happens in 
the diamond graph. 

(2) In SVD, the allocator must spill some live ranges. The heuristic for picking a 
spill candidate selects the small live ranges used in shallow loop nests because 
they are less expensive to spill. Unfortunately, spilling them is not productive 
- it does not alleviate register pressure in the major loop nests. 

At the time that the allocator makes the decisions, it cannot recognize that the 
spills will not help. Similarly, the allocator cannot retract the decisions later. 
Once spilled, a live range stays spilled. 

The coloring heuristic explored by Matula and Beck finds a two-coloring for the 
diamond graph [29]. Their algorithm differs only slightly from Chaitin’s approach. 
To simplify the graph, they repeatedly remove the node of smallest current degree, 
versus Chaitin’s approach of removing any node n where n° < k. After all nodes 
have been removed, they select colors in the reverse of the order of deletion, in the 
same fashion as Chaitin. 

On the diamond graph, this heuristic generates a two-coloring. Chaitin’s heuristic 
fails because it pessimistically assumes that all the neighbors of a node will get 
different colors. Matula and Beck’s heuristic discovers a two coloring because it 
can capitalize on the case when two neighbors receive the same color. 

Unfortunately, Matula and Beck’s scheme simply finds a coloring; there is no 
notion of finding a fc-coloring for some particular k and no mechanism for producing 
spill code. In real programs, this is a serious problem. Many procedures require 
spill code - their interference graphs are simply not fc-colorable. For example, the 
SVD routine must spill some live ranges; an optimal coloring would not eliminate 
all spills. 

We wanted an algorithm that combined Matula and Beck’s stronger coloring 
heuristic with Chaitin’s mechanism for cost-guided spill selection. To achieve this 
effect, we made two modifications to Chaitin’s original algorithm: 

(1) In simplify, the allocator removes nodes that have degree less than k in arbitrary 
order. Whenever it discovers that all the remaining nodes have degree greater 
than k, it chooses a spill candidate. That node is removed from the graph; 
but instead of marking it for spilling, simplify optimistically pushes it on the 
stack, hoping that a color will be available in spite of its high degree. Thus, 
it removes nodes in the same order as Chaitin’s allocator, but spill candidates 
are included on the stack for possible coloring. 

(2) In select, the allocator may discover that no color is available for some node. 
In that case, it leaves the node uncolored and continues with the next node. 
Any uncolored node must be a node that Chaitin’s method would spill. To see 
this, consider the case where a node n was removed from a graph G m yielding 
a new graph G m+1 . In both methods, n is inserted into G m+1 , recovering 
G m . Chaitin’s method guarantees that a color is available for n in G m . Our 
method guarantees this property for all nodes except spill candidates. Thus, an 
uncolored node must be a spill candidate, that is, a node that Chaitin would 
have spilled. 
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Fig. 4. The Optimistic Allocator 


If all nodes receive colors, the allocation has succeeded. If any nodes are uncol- 
ored, the allocator inserts spill code for the corresponding live ranges, rebuilds 
the interference graph, and tries again. 

The resulting allocator is shown in Figure 4. We call it an optimistic allocator. The 
decision to insert spill code now occurs in select, rather than in simplify. The rest 
of the allocator is unchanged from Chaitin’s scheme. In this form, the allocator can 
handle both of the problems described at the head of this section. 

Deferring the spill decision has two powerful consequences. First, it eliminates 
some non-productive spills. In Chaitin’s scheme, spill decisions are made during 
simplify, before any nodes are assigned colors. When it selects a node as a spill 
candidate, that live range is spilled. In our scheme, spill candidates get placed on 
the stack with all the other nodes. Only when select discovers that no color is 
available is the live range actually spilled. This mechanism, in effect, allows the 
allocator to reconsider spill decisions. 

Second, late spilling capitalizes on details of the color assignment to provide a 
stronger coloring heuristic. In selecting a color for node x, it examines the colors of 
all x’s current neighbors. This provides a direct measure of “does x get a color?” 
rather than estimating the answer with “is x° < fc?” In particular, if two or more 
of x’s neighbors receive the same color, then x may receive a color even though 
x° > k. The optimistic allocator finds a two-coloring for the diamond graph. 

Recall SVD. The live ranges for I, J, M, and N are early spill candidates because 
their spill costs are small. However, spilling them does not alleviate register pressure 
inside the major loop nests. Thus, the allocator must spill some of the large live 
ranges; this happens after the small live ranges have been selected as spill candidates 
and placed on the stack. When the small live ranges come off the stack in select, 
some of these large live ranges have been spilled. The allocator can easily determine 
that colors are available for these small live ranges in the early array-copy loops. 

Optimistic coloring is a simple improvement to Chaitin’s pessimistic scheme. 
Assume that we have two allocators, one optimistic and one pessimistic, and that 
both use the same spill metric - for example, Chaitin’s metric of spill cost divided 
by current degree. The optimistic allocator has a stronger coloring heuristic in the 
following sense: it will color any graph that the pessimistic allocator does and it will 
color some graphs that the pessimistic allocator will not. If spilling is necessary, the 
optimistic allocator will spill a subset of the live ranges spilled by the pessimistic 
allocator. 

Optimistic coloring helps generate better allocations. In a few cases, this elimi- 
nates all spilling; the diamond graph is one such example. In many cases, the cost 
of spilling is reduced; that is, the procedure executes fewer cycles due to register 
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spilling. Section 5 presents dynamic measurements of the improvement. An imple- 
mentation of this technique in the back-end of the IBM XL compiler family for the 
RS/6000 architecture resulted in a decrease of about twenty percent in estimated 
spill costs over the SPEC benchmark suite [24]. 

4. REMATERIALIZATION 

Even with optimistic coloring, the allocator must spill some live ranges. When this 
happens, the allocator should choose the least expensive mechanism to accomplish 
the spill. In particular, it should recognize cases where it is cheaper to recompute 
the value than to store and retrieve it from memory. Consider the code fragments 
shown in Figure 5 (the notation [p\ means “the contents of the memory location 
addressed by p”). 

Source. Note that p is constant in the first loop, but varying in the second loop. 

The register allocator should take advantage of this situation. 

Ideal. Imagine that high demand for registers in the first loop forces p to be spilled; 
this column shows the desired result. In the upper loop, p is loaded just before 
it is needed, using a “load-immediate” instruction. For the lower loop, p is 
loaded just before the loop, again using a load-immediate. 

Chaitin. This column illustrates the code that would be produced by a Chaitin- 
style allocator. The entire live range of p has been spilled to memory, with 
loads inserted before uses and stores inserted after definitions. 

Splitting. The final column shows code we would expect from a “splitting” allo- 
cator [11, 27, 23, 5]; the actual code might be worse. In fact, our work on 
rematerialization was motivated by problems observed during our own experi- 
ments with live range splitting [3] . Unfortunately, examples of this sort are not 
discussed in the literature on splitting allocators and it is unclear how best to 
extend these techniques to achieve the Ideal solution. 

This section divides into two major subsections. The first presents a conceptual 
view of our approach to rematerialization. The second discusses the implementation 
of these ideas in our allocator. 

4.1 The Ideas 

Chaitin et a 1. discuss several ideas for improving the quality of spill code [8] . They 
point out that certain values can be recomputed in a single instruction and that 
the required operands will always be available for the computation. They call these 
exceptional values never-killed and note that such values should be recalculated 
instead of being spilled and reloaded. They further note that an uncoalesced copy 
of a never-killed value can be eliminated by recomputing it directly into the desired 
register [8]. Together, these techniques are termed rematerialization. In practice, 
opportunities for rematerialization include: 

— immediate loads of integer constants and, on some machines, floating-point con- 
stants, 

— computing a constant offset from the frame pointer or the static data-area pointer, 
— loads from a constant location in either the frame or the static data-area, and 
— loading non-local frame pointers from a display. 
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Ideal 


p <— Label 
y *-y + ]p\ 


p Label 



store p 


reload p 

y<-y + \p] 



Splitting 

1 

p < — Label 
store p 


reload p 

y <- y + [p] 


reload p 



\ 1 

[/ 1 

. * ' \ 

\ reload p 


l/ \ 

J p <- 

- p+ 1 

\ / 

p <— p + 1 
/ store p 

i \ / 

J p <- 

-p + 1 

\ / 


Fig. 5. Rematerialization versus Spilling 


The values must be cheaply computable from operands that are available through- 
out the procedure. 

It is important to understand the distinction between live ranges and values. A 
live range may comprise several values connected by common uses. In the Source 
column of Figure 5, p denotes a single live range composed from three values: the 
address Label, the result of the expression p+ 1, and (more subtly) the merge of 
those two values at the head of the second loop. 

Chaitin’s allocator correctly handles rematerialization when spilling a live range 
with a single value, but cannot handle more complex cases, like the variable p in 
Figure 5. We have extended Chaitin’s work to take advantage of rematerialization 
opportunities for complex, multi-valued live ranges. Our method tags each value 
with enough information to allow the allocator to handle it correctly. To achieve 
this, we 

(1) split each live range into its component values, 

(2) tag each value with rematerialization information, and 

(3) form new live ranges from connected values having identical tags. 

This approach allows correct rematerialization of multi-valued live ranges, but in- 
troduces a new problem: minimizing unnecessary splits. The following sections 
describe how to find values, how to propagate tags, how to split live ranges, and 
how to remove unproductive splits. 

4.1.1 Discovering Values. To find values, we construct the procedure’s static 
single assignment (SSA) graph, a representation that transforms the code so that 
each use of a value references exactly one definition [13]. To achieve this goal, 
the construction technique inserts special definitions called 0-nodes at those points 
where control-flow paths join and different values merge. We actually use the 
pruned SSA graph, with dead 0-nodes eliminated [9]. 
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A natural way to view the SSA graph for a procedure is as a collection of values, 
each composed of a single definition and one or more uses. Each value’s definition 
is either a single instruction or a </>-node that merges two or more values. By 
examining the defining instruction for each value, we can recognize never-killed 
values and propagate this information throughout the SSA graph. 

4.1.2 Propagating Rematerialization Tags. To propagate tags, we use an analog 
of Wegman and Zadeck’s sparse simple constant algorithm [33]. 2 We modify their 
lattice slightly to represent the necessary rematerialization information. The new 
lattice elements may have one of three types: 

T Top means that no information is known. A value defined by a copy instruction 
or a </>- node has an initial tag of T. 

inst. If a value is defined by an “appropriate” instruction ( never-killed ), it should 
be rematerialized. The value’s tag is simply a pointer to the instruction. 

_L Bottom means that the value must be spilled and restored. Any value defined 
by an “inappropriate” instruction is immediately tagged with _L. 

Additionally, their meet operation n is modified correspondingly. The new defini- 
tion is: 


any 

n 

T 

= any 


any 

n 

_L 

= _L 


insti 

n 

instj 

= insti 

if insti = instj 

insti 

n 

instj 

= _L 

if insti 7 ^ instj 


Note that insti = instj compares the instructions on an operand-by-operand basis. 
Since our instructions have at most 2 operands, this modification does not affect 
the asymptotic complexity of propagation. 

During propagation, each value will be tagged with a particular inst or _L. Values 
defined by a copy instruction will have their tags lowered to inst or _L, depending 
on the value that flows into the copy. Tags for values defined by </>-nodes will be 
lowered to inst if and only if all the values flowing into the node have equivalent 
inst tags; otherwise, they are lowered to _L. 

This process tags each value in the SSA graph with either an instruction or _L. 
If a value’s tag is _L, spilling that value requires a normal, heavyweight spill. If, 
however, its tag is an instruction, it can be rematerialized by issuing the instruction 
specified by the tag - a lightweight spill. The tags are used in two later phases of 
the allocator: spill costs uses the tags to compute more accurate spill costs and 
spill code uses the tags to emit the desired code. 

4.1.3 Inserting Splits. After propagation, the (jy nodes must be removed and val- 
ues renamed to recreate an executable program. Consider the example in Figure 6. 
The Source column simply repeats the example introduced in Figure 5. The SSA 
column shows the effect of inserting a <jy node for p and renaming the different val- 
ues comprising p’s live range. The Splits column illustrates the copies necessary 
to distinguish the different values without </>-nodes. The final column ( Minimal ) 
shows the single copy required to isolate the never-killed value po from the other 


2 The more powerful sparse conditional constant algorithm is unnecessary; earlier optimization has 
eliminated any control-flow paths that it would detect as non-executable. 
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Source 

1 

SSA 

| 

Splits 

1 

T 

p «— Label 

T 

po <— Label 

T 

po < — Label 


Minimal 

I 

po <— Label 




values comprising p. We avoid the extra copy by noting that pi and P 2 have iden- 
tical tags after propagation (both are _L) and may be treated together as a single 
live range p \2 ■ Similarly, two connected values with the same inst tag would be 
combined into a single live range. 

For the purposes of rematerialization, the copies are placed perfectly - the never- 
killed value has been isolated and no further copies have been introduced. 3 The 
algorithm for removing </>- nodes and inserting copies is described in Section 4.2.1. 

4.1.4 Removing Unproductive Splits. Our approach inserts the minimal number 
of copies required to isolate the never-killed values. Nevertheless, coloring can 
make some of these copies superfluous. Recall the Minimal column in Figure 6. If 
neither po nor pi 2 are spilled and both receive the same color, the copy connecting 
them is unnecessary. Because it has a real run-time cost, the copy should be 
eliminated whenever possible. We cannot simply use coalesce; it would remove 
all of the copies, losing the desired separation between values with different tags. 
Therefore, we use a pair of limited coalescing mechanisms to remove unproductive 
copies: conservative coalescing and biased coloring. Conservative coalescing is a 
straightforward modification of Chaitin’s coalesce phase. Conceptually, we add a 
single constraint to coalesce - only combine two live ranges if the resulting single live 
range will not be spilled. Biased coloring increases the likelihood that live ranges 
connected by a copy get assigned to the same register. Conceptually, select tries 
to assign the same color to two live ranges connected by a copy. Taken together, 
these two mechanisms remove most of the unproductive copies. 


3 Note that the allocator could insert a 11 the copies suggested in the Splits column as a form of 
live range splitting. We are currently exploring the problem of performing live range splitting in 
a Chaitin-style allocator. So far, our experimental results have been mixed [3]. 
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4.2 Implementing Rematerialization 

Chaitin-style allocators can be extended naturally to accommodate our approach. 
The high-level structure depicted in Figure 4 is unchanged, but several low-level 
modifications are required. The next sections discuss the enhancements required in 
renumber, coalesce, and select. 

4.2.1 Renumber. Chaitin’s version of renumber was based on def-use chains [8]. 
Long before our interest in rematerialization, we adopted an implementation strat- 
egy for renumber based on the pruned SSA graph. Conceptually, that implemen- 
tation has four steps: 

(1) Determine liveness at each basic block using a sparse data-flow evaluation 
graph [9]. 

(2) Insert 0-nodes based on dominance frontiers [13]. Avoid inserting dead 0-nodes. 

(3) Renumber the operands in every instruction to refer to values instead of the 
original virtual registers. At the same time, accumulate availability information 
for each block. The intersection of live and avail is needed at each block to 
allow construction of a precise interference graph [8] . 

(4) Form live ranges by unioning together all the values reaching each 0-node using 
a fast disjoint set union. The disjoint set structure is maintained while build- 
ing the interference graph and coalescing (where coalesces are further union 
operations) . 

In our implementation, steps 3 and 4 are performed during a single walk over the 
dominator tree. Using these techniques, renumber completely avoids the use of bit- 
vectored flow analysis. Despite the apparent complexity of the algorithms involved, 
it is very fast in practice and requires only a modest amount of space. 

Because renumber already uses the SSA graph, only modest changes are required 
to support rematerialization. The modified renumber has six steps: 

(1) Determine liveness at each basic block using a sparse data-flow evaluation 
graph. 

(2) Insert 0-nodes based on dominance frontiers, still avoiding insertion of dead 
^-nodes. 

(3) Renumber the operands in each instruction to refer to values. At the same 
time, initialize the rematerialization tags for all values. 

(4) Propagate tags using the sparse simple constant algorithm as modified in Sec- 
tion 4.1.2. 

(5) Examine each copy instruction. If the source and destination values have iden- 
tical in st tags, we can union them and remove the copy. 

(6) Examine the operands of each 0-node. If an operand value has the same tag 
as the result value, union the values; otherwise, insert a split (a distinguished 
copy instruction) connecting the values in the corresponding predecessor block. 

Steps 5 and 6 are performed in a single walk over the dominator tree. 

4.2.2 Conservative Coalescing. To prevent coalesce from removing the splits so 
carefully introduced in renumber, we must limit its power. Specifically, it should 
never coalesce a split instruction if the resulting live range may be spilled. In 
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normal coalescing, two live ranges Z, and lj are combined if lj is defined by a copy 
from li and they do not otherwise interfere. In conservative coalescing, we add 
an additional constraint: combine two live ranges connected by a split if and only 
if Uj has < k neighbors of significant degree, where significant degree means a 
degree > k. 

To understand why this restriction is safe (indeed, it is conservative), recall 
Chaitin’s coloring heuristic [6]. Before any spilling, nodes of degree < k are re- 
moved from the graph. When a node is removed, the degrees of its neighbors are 
reduced, perhaps allowing them to be removed. This process repeats until the 
graph is empty or all remaining nodes have degree > k. Therefore, for a node to 
be spilled, it must have at least k neighbors with degree > k in the initial graph. 

In practice, we perform two rounds of coalescing. Initially, all possible copies 
are coalesced (but not split instructions). The graph is rebuilt and coalescing is 
repeated until no more copies can be removed. Then, we begin conservatively 
coalescing split instructions. Again, we repeatedly build the interference graph and 
attempt further conservative coalescing until no more splits can be removed. 

In theory, we should not intermix conservative coalescing with unrestricted coa- 
lescing since the result of an unrestricted coalesce may be spilled. For example, li 
and lj might be conservatively coalesced, only to have a later coalesce of kj with Ik 
provoke the spilling of lijk (since lijk may have k or more neighbors of significant 
degree). In practice, this may not prove to be a problem; we have not measured 
this effect. If intermixing conservative coalescing with unrestricted coalescing does 
not produce worse allocations, it would simplify and speed the entire process. 

Conservative coalescing directly improves the allocation. Each coalesce removes 
an instruction from the resulting code - a split instruction that was introduced by 
the allocator. In regions where there is little competition for registers, conservative 
coalescing undoes all splitting. However, it cannot undo all the non-productive 
splits by itself. 

4.2.3 Biased Coloring. The second mechanism for removing useless splits in- 
volves changing the order in which colors are considered for assignment. As renum- 
ber inserts splits, it marks pairs of values connected by a split as partners. When 
select assigns a color to k, it first tries colors already assigned to one of Z;’ s part- 
ners. With a careful implementation, this is no more expensive than picking the 
first available color; it really amounts to biasing the spectrum of colors by previous 
assignments to Z;’ s partners. 

The biasing mechanism can combine live ranges that conservative coalescing can- 
not. For example, li might have 2k neighbors of significant degree, but these neigh- 
bors might not interfere with each other and thus might all be colored identically. 
Conservative coalescing cannot combine li with any of its partners; the resulting 
live range would have too many neighbors of significant degree. Biasing may be able 
to combine li and its partners because it is applied after the allocator has shown 
that both live ranges will receive colors. At that late point in allocation, combining 
them is a matter of choosing the right colors. By virtue of its late application, the 
biasing mechanism uses a detailed level of knowledge about the problem that is not 
available any earlier in the process - for example, when coalescing is performed. 

To increase the likelihood that biasing will match partners, we can add limited 
lookahead. When picking a color for li, if it has an uncolored partner lj, the 
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allocator can look for a color that is still available for lj . On average, li has a small 
number of partners; thus, we can add limited lookahead to biased coloring without 
increasing the asymptotic complexity of select. 

5. EMPIRICAL STUDIES 

We have implemented these techniques in our optimizing compiler for FORTRAN. 
The compiler has experimental code generators for the RT/PC, the Sparc, and the 
RS/6000. To experiment with register allocation, we have built a series of allocators 
that are independent of any particular architecture [3] . 

All of our experiments have involved Chaitin-style allocators; we have not im- 
plemented any other kind of register allocator (e.g., those described in Section 7). 
There are several reasons for this. First, each allocator requires a large implemen- 
tation effort. Second, the results would be suspect because we have insight into 
tuning Chaitin-style allocators, but no practical experience with other allocators. 
Finally, the results would present a clouded picture. Since good experiments require 
controlled changes, we try to change only one thing at a time. 

Our experimental allocators work with routines expressed in ILOC, a low-level 
intermediate language designed to allow extensive optimization. An ILOC routine 
that assumes an infinite register set is rewritten in terms of a particular target 
register set, with spill code added as necessary. The target register set is specified 
in a small table and may be varied to allow convenient experimentation with a wide 
variety of register sets. 

After allocation, each ILOC routine is translated into a complete C routine. Each 
C routine is compiled and the resulting object files are linked into a complete 
program. There are several advantages to this approach: 

— By inserting appropriate instrumentation during the translation to C, we can 
collect accurate, dynamic measurements. 

— Compilation to C allows us to test a single routine in the context of a complete 
program running with real data. 

— We can perform our tests in a machine-independent fashion, potentially using a 
variety of register sets. 

Simply timing actual machine code is inherently machine-dependent and tends 
to obscure the effects of allocation. During the translation into C, we can add 
instrumentation to count the number of times any specific ILOC instruction is 
executed. For comparing register allocators, we are interested in the number of 
loads, stores, copies, load-immediates, and add-immediates. 

Figure 7 shows a small sample of ILOC code and the corresponding C transla- 
tion. Usually there is a one-to-one mapping between the ILOC statements and the 
C translations, though some additional C is required for the function header and 
declarations of the “register” variables; e.g., rl4 and fl5. Also note the simple 
instrumentation appearing immediately after several of the statements. Of course, 
this sample is very simple, but the majority of ILOC is no more complex. 

5.1 The Target Machine 

For the tests reported here, our target machine is defined to have sixteen integer 
registers and sixteen floating-point registers. Each floating-point register can hold 
a double-precision value, so no distinction is made between single-precision and 
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LLE3 : 

nop 




LLE3 : 



LLA4: 

ldi 

rl4 

8 


LLA4 : 

+ 

+ 

•H 

00 

-P 

■H 

II 

P 



add 

r9 

rl5 

rll 


r9 = rl5 + rll; 



mvf 

f 15 

fO 



f 15 = fO; C++; 



be 

L023 




goto L023; 


L023 : 

lddrr 

f 14 

rl4 

r9 

L023 : 

fl4 = *( (double *) (rl4 + r9)); 

1 + + 


dabs 

f 14 

f 14 



f 14 = f abs(f 14) ; 



dadd 

f 15 

f 15 

f 14 


f 15 = f 15 + f 14; 



addi 

rl4 

rl4 

8 


r 14 = rl4 + (8) ; a++; 



sub 

r7 

rlO 

rl4 


r7 = rlO - rl4 ; 



br 

g e 

r7 

N6 N7 


if (r7 >= 0) goto N6; else goto 

N7 ; 


Fig. 7. ILOC and C 

double-precision values once they are held in registers. Up to four integer registers 
may be used to pass arguments (recall that arguments are passed by reference in 
FORTRAN; therefore, the argument registers hold pointers to the actual values); any 
remaining arguments are passed in the stack frame. Function results are returned 
in an integer or floating-point register, as appropriate. Ten of each register class 
are designated as callee-saves; the remaining six (including the argument registers) 
are not preserved by the callee. 

When reporting costs, we assume that each load and store requires two cycles; 
all other instructions are assumed to require one cycle. Of course, these are only 
simple approximations of the costs on any real machine. 

5.2 Measuring Spill Costs 

Since our instrumentation reports dynamic counts of all loads, stores, etc., we need 
a mechanism for isolating the instructions due to allocation. A difficulty is that 
some spills are profitable. In other cases, the allocator removes instructions; e.g., 
copy instructions. Therefore, we tested each routine on a hypothetical “huge” 
machine with 128 registers, assuming this would give a nearly perfect allocation. 
The difference between the “huge” results and the results for one of the allocators 
targeted to our “standard” machine should equal the number of cycles added by 
the allocator to cope with insufficient registers. 

5.3 Experimental Results 

Our test suite is a collection of seventy routines contained in eleven programs. 
Eleven routines are from Forsythe, Malcolm, and Moler’s book on numerical meth- 
ods [18]. They are grouped into seven programs with simple drivers. The remaining 
fifty-nine routines are from the SPEC benchmark suite [32] . Four of the SPEC pro- 
grams were used: doduc (41 routines), fpppp (12 routines), matrix300 (5 routines), 
and tomcatv (1 routine). The two other FORTRAN programs in the suite (spice 
and nasa7) require language extensions not yet supplied by our front-end. 

Our results are presented as a sequence of comparisons in two tables. The first 
two columns give the program and subroutine name. The third and fourth column 
give the observed spill costs for the two allocators being compared. These costs are 
calculated from dynamic counts of instructions as described earlier. The last column 
(total) gives the percentage improvement in spill costs from the old allocator to the 
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Table I. Effects of Optimistic Coloring 



Cycles of Spill Code 

Percentage Contribution 

program routine 

Original 

Optimistic 



M&fat H 


IJggJjrUI 


fmin fmin 

551 

370 


16 


0 


32 

seval spline 

125 

117 


2 




6 

solve decomp 

362 

305 


6 

0 



16 

svd svd 

2,509 

1,977 


7 

-2 



21 

doduc colbur 

25 

19 


8 


8 


24 

dcoera 

29 

15 


21 


V 


48 



335 


12 

-0 



24 


1,939 

1,131 

HKiiSfll 

20 




42 

debico 

463 

459 

ui 

0 




1 

deseco 

5,500 

4,957 


2 

-1 

3 

0 

10 

drepvi 

252 

218 


6 




14 

TEEtr 

Z5T~ 


IXSEfl 

6 




12 

inithx 

TlM 

579 

IHifl 

4 


4 


19 

integr 

526 

502 




5 


5 

lectur 

257 

221 


2 


2 


14 


1,780 

1,433 

kb 

6 



5 

20 


1,954 

1,531 

B£B 

9 



0 

21 


651 

599 

KB 

4 




8 

supp 

146 

149 

USB 

-1 


0 


-2 

fpPPP d2esp 

51 

35 

m 

16 

-2 

2 


31 

efill 

173 


K2B 

23 


2 


u 

tpppp 

1,472 

1,444 

u 

1 




2 

twldrv 

13,731,802 


K£B 

7 

0 



18 

matrix300 lbmkl4 

136 

132 

KB 

1 




3 

sgemm 

12,321 

9,905 

BU!B 

10 




20 

sgemv 

3,027 


SKSH 

0 

-0 



40 

tomcatv tomcatv 

394,397,732 

367,995,733 | 


3 

-0 



7 


new one - large positive numbers indicate significant improvements. The middle 
columns show the contribution of each instruction type to the total. 

All percentages have been rounded to the nearest integer. Insignificant improve- 
ments are reported as 0 and insignificant losses are reported as —0. In cases where 
the result is zero, we simply show a blank. Since results are rounded, a total entry 
may not equal the sum of the contributing entries. Each table shows only routines 
where a difference was observed. 4 

Consider the first row in Table I. This row presents results for the routine 
fmin from the program fmin. The allocator using Chaitin’s heuristic generated 
an allocation requiring 551 cycles of spill code; the optimistic allocator required 
only 370 cycles. 16% of the savings came from having to execute fewer loads and 
16% arose from fewer stores. There was a further insignificant improvement due to 
executing fewer load-immediates. The total improvement was 32%. 

5.3.1 Optimistic Coloring. Table I shows a comparison of test results for our 
allocator using two different coloring heuristics. The column labeled Original gives 
data for a version using Chaitin’s coloring heuristic; the column labeled Optimistic 
gives data for a version using our optimistic coloring heuristic. 

In our test suite of 70 routines, we measured improvements in 26 cases and a 
single loss (an extra load and store were required). Improvements ranged from 
tiny to quite large, sometimes reducing spill costs by over 40%. The single loss was 


4 The raw data for the tables is given in Appendix A of Briggs’ thesis [3]. 
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Table II. Effects of Rematerialization 



Cycles of Spill Code 

Percentage Contribution 

program 

routine 

Optimistic 

Remat. 







rkf'45 

fehl 

68 

50 

■SIS 


7 

-7 


27 

seval 

spline 

117 

102 

10 

2 

2 

-1 


13 

solve 

decomp 

305 

286 

4 

3 


-1 


6 

svd 

svd 

1,977 

1,966 

1 


0 

-0 


1 

zeroin 

zeroin 

236 

234 

2 


-1 



1 

doduc 

bilan 

1,046 

966 

5 

3 




8 


bilsla 

16 

15 




6 


6 


colbur 

19 

24 

-11 

-11 

-5 



-26 


ddefiu 

335 

375 

-5 

-7 

1 

1 


-12 


debico 

IW 

TUT" 

6 

0 

1 

2 


9 


deseco 

4,957 

4,636 

7 

2 


-2 

0 

7 


drepvi 

218 

175 

4 

14 

0 

2 


20 


drigl 

32 

31 




3 


3 


heat 

34 

31 

6 


1 



9 



500“ 

395 

1 

0 


-0 


1 



50 

48 




4 


4 


inisla 

31 

28 


6 


3 


10 


inithx 

579 


17 

10 


-2 


25 


integr 

502 

372 

18 

12 


-3 


26 



221 

166 



2 

23 


25 


ggzlliS] 

39 

35 


5 

-3 

8 


10 



1,433 

1,383 

8 

0 

-1 

-4 


4 


pastern 

289 

220 

im 

10 

13 

-19 


24 


prophy 

1,531 

1,525 


0 


0 


0 


repvid 

599 

50 T~ 

9 

13 

11 



33 

fpppp 

d2esp 

35 

34 

6 



-3 


3 


main 

210 

199 



0 

5 


5 


twldrv 

11,311,624 

11,198,058 

2 

0 


-1 


1 

matrix300 

sgemm 

9,905 


12 

6 


-3 


15 

tomcatv 

tomcatv 

367,995,733 

355,039,258 

4 

0 

-0 



4 


disappointing, since we have claimed that the optimistic coloring heuristic can never 
spill more than Chaitin’s heuristic. However, we must recall the structure of the 
allocator. After each attempt to color, spill code is inserted and the entire build- 
coalesce-color process is repeated. The optimistic coloring heuristic will perform 
at least as well as Chaitin’s heuristic on any graph; but after spilling, the two 
allocators will be facing different problems. 

5.3.2 Rematerialization. Table II summarizes the effect of our new approach to 
rematerialization. It compares two versions of the optimistic allocator that differ 
only in their handling of never-killed values. The column labeled Optimistic gives 
data for a version that uses Chaitin’s limited approach to rematerialization; the 
column labeled Remat. gives data for a version incorporating our new method. 

From the entire suite of 70 routines, we observed improvements in 28 cases and 
degradations in only 2 cases. One loss was very small (2 loads, 2 stores, and an 
extra copy); the other was somewhat larger. Improvements ranged from tiny to 
reasonably large, with many greater than 20%. Of course, adjusting the relative 
costs of each instruction, especially loads and stores, will change the amount of 
improvement. 

As expected, we see a pattern of fewer load instructions and more load-immediate 
instructions. Typically, the number of stores and the number of copies are also 
reduced. The reduced number of copy instructions suggests that our heuristics 
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for removing unhelpful splits are adequate in practice. Note that this reduction is 
obtained in spite of the extra copies introduced by renumber. 

5.4 Allocation Costs 

The improved allocations come at a cost in compile-time. In the case of the op- 
timistic allocator, the coloring phase can be more expensive due to the need to 
attempt a coloring even if spill candidates are discovered. Support for remateri- 
alization requires several extra steps. An extra pass over the code is required to 
initialize rematerialization tags before propagation and further time is required to 
propagate the tags throughout the routine. Finally, at least one extra pass is re- 
quired to accomplish conservative coalescing. On the other hand, the build-coalesce 
process may be slightly faster since we can eliminate some copies during renumber 
(recall step 5 in Section 4.2.1). 

Table III shows comparative timings for the three allocators on three routines 
from the SPEC suite. Times are given in seconds and were measured with a 
100 hertz clock on an unloaded IBM RS/6000 Model 540. Each run was repeated 
10 times and the results were averaged. The first column shows the phase of alloca- 
tion. Cfa. stands for control-flow analysis and includes the time required to compute 
forward and reverse dominators and dominance frontiers. Build includes the entire 
build-coalesce loop. Color includes both simplify and select. Note that tomcatv 
required more rounds of spilling than the other routines. For each routine, the 
Orig. column gives times required by the allocator with Chaitin’s original coloring 
heuristic, the Opt. column gives the times required by the optimistic allocator, and 
the Remat. column gives times required by the optimistic allocator with improved 
rematerialization. 

We selected three routines to illustrate performance over a range of sizes. The 
first routine is repvid, from the program doduc, with 144 non-comment lines of 
FORTRAN. It compiles to a .text size of 1284 bytes using IBM’s xlf compiler 
with full optimization. The second routine is tomcatv, with 133 lines and a .text 
size of 3064 bytes. The largest routine is twldrv from the program fpppp, with 
881 lines and a .text size of 15,616 bytes. All three routines appear in Table I. 

An obvious conclusion to draw from the data in Table III is that support for 
rematerialization can require a small amount of additional compile-time. On the 
other hand, the optimistic coloring heuristic has very little cost and is sometimes 
faster than Chaitin’s pessimistic heuristic. 

The results in Table III also illuminate a number of interesting details about the 
behavior of all three allocators. 

— The initial pass of the build-coalesce loop dominates the overall cost of allocation 
(as noted by Chaitin). In comparison, additional iterations of the color-spill 
process are quite inexpensive. 

— In each case, the cost of renumber is higher for Remat., reflecting the cost of 
propagating rematerialization tags. 

— In most cases, the cost of the build-coalesce loop is higher for Remat. due to the 
additional passes of conservative coalescing. 

— The very low cost of control-flow analysis illustrates the speed and practicality 
of the algorithm for calculating dominance frontiers [13]. 
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Table III. Allocation Times in Seconds 



repvid 

tomcatv 

twldrv 

Phase 

Orig. 

Opt. 

Remat. 

Orig. 

Opt. 

Remat. 

Orig. 

Opt. 

Remat. 

cfa 

.00 

.00 

.00 

.00 

.00 

.01 

.01 

.01 

.01 

renum 

.04 

.03 

.04 

.06 

.06 

.07 

.56 

.56 

.62 

build 

.16 

.16 

.16 

.38 

.38 

.42 

9.88 

9.84 

8.47 

costs 

.01 

.01 

.01 

.02 

.02 

.02 

.14 

.16 

.14 

color 

.01 

.02 

.03 

.03 

.04 

.06 

1.12 

1.08 

1.61 

spill 

.01 

.01 

.01 

.02 

.02 

.02 

.18 

.16 

.16 

renum 






.03 

.08 

.10 

.14 

build 

.06 

.06 

.06 

.09 

.09 

.12 

.62 

.61 

.81 

costs 

.01 

.01 

.01 

.01 

.01 

.01 

.07 

.07 

.06 

color 

.01 

.01 

.02 

.01 

.02 

.04 

.11 

.14 

.31 

spill 

.01 

.00 

.01 

.09 

.01 

.01 

.04 

.03 

.04 

renum 


.01 





.09 

.10 

.14 

build 

.04 

.04 

.06 

.05 

.05 

.09 

.40 

.59 

.80 

costs 

.01 

.01 

.01 

.01 

.01 

.01 

.07 

.07 

.06 

color 

.01 

.01 

.02 

.02 

.02 

.04 

.19 

.15 

.31 

spill 




.01 

.00 

.01 




renum 










build 




.06 

.05 

.09 




costs 




.01 

.01 

.01 




color 




.01 

.02 





spill 




.01 






renum 










build 




.05 






costs 




.01 






color 




.02 






total 

.40 

.39 

.47 

.97 

.85 

1.13 

13.55 

13.65 

13.67 


— The higher cost of coloring in the first pass arises from the cost of choosing nodes 
to spill. While the cost of coloring is linear in the size of the graph, spill selection 
is 0(s • n), where s is the number of spill choices and n is the number of nodes. 
With a large number of spills, this term dominates the cost of coloring. 

We are pleased with the overall speed of the allocators. Our results appear to be 
slightly faster than the times reported by IBM’s xlf compiler for register allocation 
and comparable to the times reported for optimization. In an extensive comparison 
with priority-based coloring, our allocators appeared much slower on very small 
routines, but much faster on very large routines [3] . Of course, these speeds are not 
competitive with the fast, local techniques used in non-optimizing compilers [19, 20]; 
however, we believe that global optimizations require global register allocation. 

6. IMPLEMENTATION INSIGHT 

To perform the experiments described in Section 5, we implemented several versions 
of our allocator. Naturally, we gained some insight into the implementation of 
Chaitin-style graph coloring allocators. This section attempts to convey those ideas 
in a concise form. We hope that it proves useful to other implementors. When 
possible, we label an insight with the particular phase of the allocator that it 
affects. 
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Constant values. The rematerialization scheme suggests that the compiler should 
represent constants in a way that makes them visible to the allocator. In our 
current compiler, integer constants are obvious in the code; they are loaded 
using an integer load immediate instruction. Non-integer constants are stored 
in a static constant pool and loaded using a fixed offset from the static data 
area pointer. This effectively hides them from rematerialization. Changing our 
intermediate representation to make these values visible would expose them to 
rematerialization, improving the quality of the final code. 

Build. Like Chaitin, we advocate building two representations of the interference 
graph: both a bit matrix and a set of adjacency vectors. Many implementations 
use a single representation and claim a space savings. However, both forms of 
the graph are needed for efficient execution of the allocator. 

In practice, we have not had problems with an explosion in the size of the 
interference graph. Our allocator runs in a reasonable amount of space. For 
example, the interference graph of tomcatv requires 146,404 bytes while the 
graph for twldrv requires 3,357,640 bytes [3]. Usually, the bit vectors for an 
iterative solution of the live variables problem are as large as our interference 
graphs. This fact led us to use the sparse evaluation graph techniques of Choi, 
Cytron, and Ferrante [9]. 

Coalescing. Coalescing is not transitive. Even though l t and lj do not interfere and 
li and Ik do not interfere, Uj will interfere with Ik if lj interferes with Ik- Thus, 
the order of coalescing is important. 

Coalescing should proceed from innermost loops outward. Inside-out coalescing 
ensures that copies in more deeply nested loops get removed before copies in 
outer loops. In practice, this makes a noticeable difference on many routines. 

Conservative coalescing and biased coloring. There is a direct trade-off between the 
strength of conservative coalescing and the effectiveness of biased coloring. In 
practice, we experimented with several different restrictions on coalescing. More 
precise (i.e., less conservative) restrictions on coalescing increase the set of live 
ranges that it can combine; this decreases the number of live ranges combined 
by biased coloring. It is unclear whether this occurs because the more precise 
coalescing condition restricts freedom in the graph or because it combines live 
ranges that biased coloring also will catch. In any case, biased coloring cannot 
completely eliminate the need for conservative coalescing. 

Computing spill costs. In our allocator, spill costs is always executed. Chaitin’s 
allocator waits until it must choose the first value to spill [7]. Each approach 
has its merits. 

Some live ranges have negative spill costs. This suggests that it is less expensive 
to store and reload them than to keep them in registers. Our allocator computes 
spill costs early and aggressively removes live ranges with negative spill cost. 
Chaitin defers the computation of spill costs until the allocator recognizes that 
it must spill. This speeds allocation on procedures that color without spilling. 
Additionally, the allocator can skip the spill-cost computation for live ranges 
that it has already removed from the graph. 

Simplify. Rematerialization fundamentally changes the allocator. Live ranges are 
unrelated in other Chaitin-style allocators. The order of removal for trivially 
colored nodes has little noticeable effect on the results of allocation. The in- 
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troduction of splits to isolate never-killed values and the use of biased coloring 
change that fact. This gives a new importance to the relative removal order of 
trivially colored nodes. When faced with multiple nodes that can be removed, 
select should first remove nodes that have no partners. This forces the nodes 
with partners to be colored first. They are colored in a less constrained graph. 
In practice, this increases the effectiveness of biased coloring. 

Picking spill candidates. Introducing splits in order to support rematerialization 
increases the number of nodes in the interference graph. This can adversely 
effect the running time of simplify by increasing the time spent searching for 
spill candidates. As described in Section 2, the search for a spill candidate must 
examine each node remaining in the current graph. A naive implementation 
repeats this process for each spill because the spill metric includes the node’s 
current degree. Thus, choosing spill candidates takes 0(k ■ N ) time, where k is 
the number of spills and N is the number of live ranges in the original graph. 
Undoubtedly, some careful algorithmic work can improve this situation. In 
practice, we refactored the comparison by noting that 

COSt{l x ) > COSt(lj ) .f an( _j on jy jf cost(li) x l ? > COSt(lj) x 1°. 

Replacing the floating-point divisions with integer multiplies improved the con- 
stant factor. 

Limited backtracking. Select assigns colors to nodes in a somewhat arbitrary man- 
ner. Biased coloring capitalizes on this fact by carefully choosing the order 
in some cases where it matters. Nonetheless, improved color choice can avoid 
some spills. In the optimistic allocator, we can add a limited form of backtrack- 
ing. When no color is available for node p, select can consider recoloring one 
of p’s neighbors to open up a color for p. Such backtracking must be carefully 
constrained to avoid a combinatorial explosion. 

While examining p’s neighbors to determine available colors, we can accumulate 
the number of uses of each color and note which neighbor uses each color. If no 
color remains for p and one of the colors is used by only one of p’s neighbors, 
select can try to recolor that neighbor. If it succeeds, a color is available for p. 
Limited backtracking is easy to implement. It takes very little compile time. 
It rarely leads to degradation. 5 On the other hand, it rarely leads to a sig- 
nificant improvement [3]. It also fits into biased coloring; however, experience 
suggests that limited lookahead (see Section 4.2.3) produces more consistent 
improvement than limited backtracking in the biased scheme. 

NP-noise. In testing, we have often stumbled on a phenomenon that we label NP- 
noise. When measuring the output of a process that involves heuristic solution 
of an NP-complete problem, the answers often contain behavior that seems 
anomalous. Sometimes the heuristic must pick from a set of choices that have 
the same local cost. Different choices can lead to different answers. 

In the allocator, we use a linear-time heuristic to find the coloring. From a 
decision tree perspective, it picks a path through a tree in time proportional to 


5 We have never seen it produce a worse allocation, though such situations are conceivable. It is 
often better to save an expensive spill now instead of possibly saving less-expensive spills in the 
future. 
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the height of the tree. Of course, the tree has an exponential number of nodes. 
At each step, it picks based on local information - the history of previous choices 
and the remaining nodes. Changing a local choice can lead to a different global 
result. Thus, simple things like the ordering of nodes, how ties are broken, and 
actual color assignments produce variations in the output. 

These variations make it difficult to compare two competing allocators. For 
example, in Table II, our rematerialization scheme produced worse code than 
the optimistic allocator with Chaitin’s scheme for two of the routines, colbur 
and ddef lu in doduc. A heuristic that, in general, has better global behavior, 
can produce anomalous results in specific cases. 

This effect may, in part, account for the effects seen by the Haifa group with 
their “best of three” spilling scheme [2] . They smooth out some of the NP-noise 
by using three different spill metrics and taking the best result. 

7. OTHER WORK 

The relationship between economical use of memory and graph coloring has been 
discussed for a long time. The first paper on this subject appears to be Lavrov’s 
1961 paper on minimizing memory use [28]. He suggests building an inconsistency 
graph and coloring it; he does not propose a practical method for finding a coloring. 
Ershov and his colleagues built on this work in the ALPHA project. They solved 
storage allocation problems by building an interference graph and using a packing 
algorithm on it [14, 15] . By the late sixties, Cocke was clearly talking about applying 
these insights directly to register allocation; both Kennedy and Schwartz credit him 
with this insight [26, 31]. 

This early work on graph coloring register allocation emphasizes the coloring 
problem with little consideration for the questions of spill choice and placement. 
Algorithms by Cocke and Ershov (as reported by Schwartz [31]) are concerned 
exclusively with minimizing the number of colors required. There is no discussion 
of spill code and the flow graph is ignored entirely. 

The first complete register allocator based upon graph coloring is described by 
Chaitin and his colleagues [8] . Spilling is handled by a variety of heuristics, some 
based upon an interval analysis of the flow graph. Unfortunately, these ad hoc 
techniques are expensive and not always effective. In a subsequent paper, Chaitin 
introduces a simpler technique that attempts to solve the spilling problem based 
on the interference graph and spill-cost estimates for each live range [6] . Section 2 
describes this allocator. 

Chow and Hennessy describe a priority-based coloring scheme [10, 11]. In their 
work, values initially reside in memory. They divide the register set between lo- 
cal and global allocation. They build an imprecise interference graph and color 
constrained live ranges in priority order. Their priority function uses spill cost 
normalized by live range length. When a constrained live range cannot be colored, 
they split it into smaller pieces and try to color them independently. 

Fabri, in the context of her work on minimizing memory storage requirements, 
introduces a renaming transformation that is analogous to Chow’s live range split- 
ting [16]. She notes that, in some cases, splitting improved the resulting coloring. 
She mentions the idea’s applicability to register allocation problems. 

Several groups have refined Chow’s approach. Larus and Hilfinger make several 
modifications [27]. In their work, values initially reside in registers. They elim- 
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mate local allocation, limit basic block sizes, and use a more sophisticated live 
range splitting technique. Gupta, Soffa, and Steele describe an approach that par- 
titions the interference graph into subgraphs that are colored individually and later 
merged [23]. 

Johnson and Miller describe another priority-based allocator [25]. They assign 
colors in order of decreasing degree in the interference graph. When they must spill, 
they examine both the current live range and all those that have been assigned 
colors and spill the value with minimal cost under their spill metric. Their metric 
combines Chaitin’s notion of spill cost with a term that includes the impact of 
keeping a value in a register across loops where it is live but not used. While we 
have not implemented a version of their algorithm, we believe that their scheme 
will avoid problems like the one we observed with the SVD routine. On the other 
hand, it will spill some live ranges that Chaitin proves are trivially colored. 

Bernstein and his colleagues describe a collection of techniques [2]. They show 
that a best-of-three coloring scheme produces better results than Chaitin’s original 
scheme. They impose a “largest degree first” ordering when removing unconstrained 
nodes from the graph in simplify (Step (5a) in Section 2). They introduce a cleaning 
heuristic to decrease the amount of spill code generated within a single basic block. 
In our experience, these improvements are orthogonal to those presented in this 
paper. 

Callahan and Koblenz construct a fine-grained hierarchical decomposition of the 
flow graph, a tiling [5]. Coloring is performed for individual tiles and the results 
are merged in two passes over the tree. Their algorithm is explicitly parallel. 

Of course, not all global allocators are built on the graph coloring paradigm. 
Other approaches include the bin-packing allocators built by Digital Equipment 
Corporation and the probabilistic allocator of Fischer and Proebsting [1, 17]. It is 
difficult to compare these techniques because the implementations work on different 
intermediate representations, they follow different optimizers, and the compilers 
target different machines. 


8. SUMMARY 

Optimistic coloring is a simple improvement over Chaitin’s original allocator. It 
produces the same allocation as Chaitin’s method, except when it improves on 
the result. The results in Table I show that this happens regularly. The costs 
are nearly identical; occasionally one method will require an extra trip around the 
main loop for spill code insertion. In another paper, we have shown that optimistic 
allocators generate better allocations on machines that use register pairs for some 
values [4] . Any implementation of a Chaitin-style allocator should incorporate this 
improvement; it makes a significant difference in the allocation quality at little or 
no cost. 

Our method of rematerializing never-killed values finds the maximum extents for 
each never-killed value. It ensures that the allocator will spill these values in the 
least expensive way. In procedures where never-killed values must be spilled, the 
result is better code. The results in Table II show that this actually happens. When 
no opportunities for rematerialization exist, the features added for rematerialization 
do not hurt. Additionally, the algorithms that we describe in Section 4 are both 
fast and practical. 
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